Note: To view output with code, go to this page on Rpubs.

Background

Principal investigator Dr. Vinca Monster of the Grape Program at State U needs me, a poor graduate student, to help her understand influences of physico-chemical properties on wine preferences. Her laboratory has gathered an extensive dataset on Portugese white varietals.

This report uses exploratory data analysis and linear regression to determine associations of wine properties on preference.

## 
## Please cite as:
##  Hlavac, Marek (2015). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2. http://CRAN.R-project.org/package=stargazer
## Parsed with column specification:
## cols(
##   `fixed acidity` = col_double(),
##   `volatile acidity` = col_double(),
##   `citric acid` = col_double(),
##   `residual sugar` = col_double(),
##   chlorides = col_double(),
##   `free sulfur dioxide` = col_double(),
##   `total sulfur dioxide` = col_double(),
##   density = col_double(),
##   pH = col_double(),
##   sulphates = col_double(),
##   alcohol = col_double(),
##   quality = col_integer()
## )

Exploration

Summary Statistics

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

It doesn’t seem as if there is any unaccounted missing data.

Graphs

Density, fixed acidity, and free sulfur dioxide seem like candidates to be logged in order to help with distribution (their slopes are definitely not zero).

Now I’ll try the logged variables. Those transformations don’t seem to help much visually, but I will try my model with and without them logged to check.

Finally, I just want to check correlations to develop a better idea of what I’m putting into my model.

##                           quality     alcohol   chlorides  citric.acid
## quality               1.000000000  0.43557472 -0.20993441 -0.009209091
## alcohol               0.435574715  1.00000000 -0.36018871 -0.075728730
## chlorides            -0.209934411 -0.36018871  1.00000000  0.114364448
## citric.acid          -0.009209091 -0.07572873  0.11436445  1.000000000
## density.log          -0.307723788 -0.78135429  0.25757406  0.149442828
## fixed.acidity.log    -0.109736681 -0.13108514  0.03305563  0.292566248
## free.sulf.diox.log    0.099058582 -0.22409197  0.09168564  0.084276395
## pH                    0.099427246  0.12143210 -0.09043946 -0.163748211
## residual.sugar       -0.097576829 -0.45063122  0.08868454  0.094211624
## sulphates             0.053677877 -0.01743277  0.01676288  0.062330940
## total.sulfur.dioxide -0.174737218 -0.44889210  0.19891030  0.121130798
## volatile.acidity     -0.194722969  0.06771794  0.07051157 -0.149471811
##                      density.log fixed.acidity.log free.sulf.diox.log
## quality              -0.30772379       -0.10973668         0.09905858
## alcohol              -0.78135429       -0.13108514        -0.22409197
## chlorides             0.25757406        0.03305563         0.09168564
## citric.acid           0.14944283        0.29256625         0.08427640
## density.log           1.00000000        0.27695036         0.28317156
## fixed.acidity.log     0.27695036        1.00000000        -0.04534913
## free.sulf.diox.log    0.28317156       -0.04534913         1.00000000
## pH                   -0.09368819       -0.43478921         0.02199554
## residual.sugar        0.83864966        0.10237716         0.30293472
## sulphates             0.07444942       -0.01415546         0.06084248
## total.sulfur.dioxide  0.53044357        0.10259928         0.59619976
## volatile.acidity      0.02661505       -0.02974209        -0.11663198
##                                pH residual.sugar   sulphates
## quality               0.099427246    -0.09757683  0.05367788
## alcohol               0.121432099    -0.45063122 -0.01743277
## chlorides            -0.090439456     0.08868454  0.01676288
## citric.acid          -0.163748211     0.09421162  0.06233094
## density.log          -0.093688189     0.83864966  0.07444942
## fixed.acidity.log    -0.434789207     0.10237716 -0.01415546
## free.sulf.diox.log    0.021995543     0.30293472  0.06084248
## pH                    1.000000000    -0.19413345  0.15595150
## residual.sugar       -0.194133454     1.00000000 -0.02666437
## sulphates             0.155951497    -0.02666437  1.00000000
## total.sulfur.dioxide  0.002320972     0.40143931  0.13456237
## volatile.acidity     -0.031915368     0.06428606 -0.03572815
##                      total.sulfur.dioxide volatile.acidity
## quality                      -0.174737218      -0.19472297
## alcohol                      -0.448892102       0.06771794
## chlorides                     0.198910300       0.07051157
## citric.acid                   0.121130798      -0.14947181
## density.log                   0.530443572       0.02661505
## fixed.acidity.log             0.102599278      -0.02974209
## free.sulf.diox.log            0.596199757      -0.11663198
## pH                            0.002320972      -0.03191537
## residual.sugar                0.401439311       0.06428606
## sulphates                     0.134562367      -0.03572815
## total.sulfur.dioxide          1.000000000       0.08926050
## volatile.acidity              0.089260504       1.00000000

The variable most correlated with quality is alcohol, so I will use that as my primary independent variable. Density is highly correlated with alcohol (r=-.78) and residual sugar (r=.84),and free sulfur dioxide less so with total sulfur dioxide (r=.59), so there may be some collinearity isues there.

Regression

Exploring Models

Below is my first regression, with all potential independent variables included.

## 
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density.log + 
##     fixed.acidity.log + free.sulf.diox.log + pH + residual.sugar + 
##     sulphates + total.sulfur.dioxide + volatile.acidity, data = White_wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4172 -0.5008 -0.0287  0.4585  3.0836 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.249e+00  5.481e-01  -2.279 0.022726 *  
## alcohol               1.976e-01  2.411e-02   8.199 3.07e-16 ***
## chlorides            -4.037e-01  5.383e-01  -0.750 0.453354    
## citric.acid           3.611e-03  9.443e-02   0.038 0.969494    
## density.log          -9.368e+01  1.311e+01  -7.144 1.04e-12 ***
## fixed.acidity.log     3.700e-01  9.982e-02   3.707 0.000212 ***
## free.sulf.diox.log    2.163e-01  1.780e-02  12.155  < 2e-16 ***
## pH                    6.609e-01  1.052e-01   6.285 3.57e-10 ***
## residual.sugar        7.305e-02  7.478e-03   9.768  < 2e-16 ***
## sulphates             6.300e-01  9.898e-02   6.364 2.14e-10 ***
## total.sulfur.dioxide -1.901e-03  3.695e-04  -5.146 2.76e-07 ***
## volatile.acidity     -1.651e+00  1.130e-01 -14.615  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7412 on 4886 degrees of freedom
## Multiple R-squared:  0.3012, Adjusted R-squared:  0.2996 
## F-statistic: 191.4 on 11 and 4886 DF,  p-value: < 2.2e-16

Interpretation: For each 1-unit increase in alcohol (I am guessing 1 percent alcohol content), the rating of quality increases by 0.198 on a 7 point scale, holding all other variables (different qualities of the wine) constant. This is significant at p<.001.

This interpretation could be extended to any of the other independent variables. For example, a 1 unit increase of chlorides is associated with a .404 decrease in rating of wine quality, all other independent variables held constant; however, the p-value, .65, is not significant.

## 
## Call:
## lm(formula = quality ~ alcohol + chlorides + citric.acid + density + 
##     fixed.acidity + free.sulfur.dioxide + pH + residual.sugar + 
##     sulphates + total.sulfur.dioxide + volatile.acidity, data = White_wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8348 -0.4934 -0.0379  0.4637  3.1143 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.502e+02  1.880e+01   7.987 1.71e-15 ***
## alcohol               1.935e-01  2.422e-02   7.988 1.70e-15 ***
## chlorides            -2.473e-01  5.465e-01  -0.452  0.65097    
## citric.acid           2.209e-02  9.577e-02   0.231  0.81759    
## density              -1.503e+02  1.907e+01  -7.879 4.04e-15 ***
## fixed.acidity         6.552e-02  2.087e-02   3.139  0.00171 ** 
## free.sulfur.dioxide   3.733e-03  8.441e-04   4.422 9.99e-06 ***
## pH                    6.863e-01  1.054e-01   6.513 8.10e-11 ***
## residual.sugar        8.148e-02  7.527e-03  10.825  < 2e-16 ***
## sulphates             6.315e-01  1.004e-01   6.291 3.44e-10 ***
## total.sulfur.dioxide -2.857e-04  3.781e-04  -0.756  0.44979    
## volatile.acidity     -1.863e+00  1.138e-01 -16.373  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7514 on 4886 degrees of freedom
## Multiple R-squared:  0.2819, Adjusted R-squared:  0.2803 
## F-statistic: 174.3 on 11 and 4886 DF,  p-value: < 2.2e-16

This is the same model without density, fixed acidity, and free sulfur dioxide logged. The R-squared is noticeably reduced by about .02, so I’ll keep those variables logged.

## 
## Call:
## lm(formula = quality ~ alcohol + density.log + fixed.acidity.log + 
##     free.sulf.diox.log + pH + residual.sugar + sulphates + total.sulfur.dioxide + 
##     volatile.acidity, data = White_wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4167 -0.5008 -0.0281  0.4551  3.0856 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -1.357701   0.528272  -2.570   0.0102 *  
## alcohol                0.197790   0.023995   8.243  < 2e-16 ***
## density.log          -95.276842  12.898858  -7.386 1.76e-13 ***
## fixed.acidity.log      0.381585   0.097972   3.895 9.96e-05 ***
## free.sulf.diox.log     0.215757   0.017773  12.139  < 2e-16 ***
## pH                     0.673841   0.103283   6.524 7.53e-11 ***
## residual.sugar         0.074140   0.007321  10.126  < 2e-16 ***
## sulphates              0.632422   0.098843   6.398 1.72e-10 ***
## total.sulfur.dioxide  -0.001903   0.000369  -5.158 2.60e-07 ***
## volatile.acidity      -1.658880   0.110856 -14.964  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7411 on 4888 degrees of freedom
## Multiple R-squared:  0.3011, Adjusted R-squared:  0.2998 
## F-statistic: 233.9 on 9 and 4888 DF,  p-value: < 2.2e-16

Below, I try model 1 without density, recalling that density is highly correlated with two other variables and has a high standard error in the previous models.

## 
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity.log + free.sulf.diox.log + 
##     pH + residual.sugar + sulphates + total.sulfur.dioxide + 
##     volatile.acidity, data = White_wines)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3023 -0.5071 -0.0281  0.4480  3.1178 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.1519959  0.4067184   2.832  0.00464 ** 
## alcohol               0.3574313  0.0104813  34.102  < 2e-16 ***
## fixed.acidity.log    -0.1357944  0.0688745  -1.972  0.04871 *  
## free.sulf.diox.log    0.2379961  0.0176120  13.513  < 2e-16 ***
## pH                    0.1981555  0.0811874   2.441  0.01469 *  
## residual.sugar        0.0232764  0.0025007   9.308  < 2e-16 ***
## sulphates             0.4362305  0.0957271   4.557 5.31e-06 ***
## total.sulfur.dioxide -0.0024790  0.0003626  -6.837 9.09e-12 ***
## volatile.acidity     -1.7508316  0.1107562 -15.808  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7451 on 4889 degrees of freedom
## Multiple R-squared:  0.2933, Adjusted R-squared:  0.2921 
## F-statistic: 253.6 on 8 and 4889 DF,  p-value: < 2.2e-16

Interpretation: With density dropped, there are no anomolous standard errors. R-squared decreases slightly, but not enough to be practically significant. This seems like a good candidate for the final model, but I’ll do diagnostics below.

diagnostics

Comparison of 3 Regression outputs
Dependent variable:
quality
(1) (2) (3)
alcohol 0.198*** 0.198*** 0.357***
(0.024) (0.024) (0.010)
chlorides -0.404
(0.538)
citric.acid 0.004
(0.094)
density.log -93.679*** -95.277***
(13.113) (12.899)
fixed.acidity.log 0.370*** 0.382*** -0.136**
(0.100) (0.098) (0.069)
free.sulf.diox.log 0.216*** 0.216*** 0.238***
(0.018) (0.018) (0.018)
pH 0.661*** 0.674*** 0.198**
(0.105) (0.103) (0.081)
residual.sugar 0.073*** 0.074*** 0.023***
(0.007) (0.007) (0.003)
sulphates 0.630*** 0.632*** 0.436***
(0.099) (0.099) (0.096)
total.sulfur.dioxide -0.002*** -0.002*** -0.002***
(0.0004) (0.0004) (0.0004)
volatile.acidity -1.651*** -1.659*** -1.751***
(0.113) (0.111) (0.111)
Constant -1.249** -1.358** 1.152***
(0.548) (0.528) (0.407)
Observations 4,898 4,898 4,898
R2 0.301 0.301 0.293
Adjusted R2 0.300 0.300 0.292
Residual Std. Error 0.741 (df = 4886) 0.741 (df = 4888) 0.745 (df = 4889)
F Statistic 191.408*** (df = 11; 4886) 233.949*** (df = 9; 4888) 253.595*** (df = 8; 4889)
Note: p<0.1; p<0.05; p<0.01

##                      Test stat Pr(>|t|)
## alcohol                  5.270    0.000
## chlorides                1.403    0.161
## citric.acid             -4.424    0.000
## density.log              5.229    0.000
## fixed.acidity.log       -3.103    0.002
## free.sulf.diox.log     -11.193    0.000
## pH                       0.964    0.335
## residual.sugar           2.481    0.013
## sulphates                0.047    0.963
## total.sulfur.dioxide    -8.039    0.000
## volatile.acidity         3.210    0.001
## Tukey test               2.656    0.008

##                      Test stat Pr(>|t|)
## alcohol                  5.203    0.000
## density.log              5.265    0.000
## fixed.acidity.log       -3.020    0.003
## free.sulf.diox.log     -11.170    0.000
## pH                       0.959    0.338
## residual.sugar           2.507    0.012
## sulphates                0.063    0.950
## total.sulfur.dioxide    -8.013    0.000
## volatile.acidity         3.228    0.001
## Tukey test               2.540    0.011

##                      Test stat Pr(>|t|)
## alcohol                  5.269    0.000
## fixed.acidity.log       -3.706    0.000
## free.sulf.diox.log     -10.920    0.000
## pH                       0.232    0.817
## residual.sugar          -1.439    0.150
## sulphates                0.095    0.924
## total.sulfur.dioxide    -7.631    0.000
## volatile.acidity         1.906    0.057
## Tukey test               0.942    0.346

This shows model 3 as a good fit. Without reducing R-squared much, it seems clear the model can do without density and citric acid. All the other independent variables stay statistically significant.

Are their any influential outliers?

Below identifies influential observations.

Below identifies observations with large residuals.

## 3308  446 3811 
##    1    2    3

Below identifies outliers.

##       rstudent unadjusted p-value Bonferonni p
## 3308 -4.449019          8.818e-06      0.04319

Below determines whether any points are highly influential.

NB. If there are points that are a) outliers AND b) highly influential, these have potential to change the inference. You should consider removing them.

To make sense of the plots above, I create an influence plot. Residuals of +/-2 can be problematic.

##          StudRes         Hat        CookD
## 446  -4.27799398 0.001869208 3.794668e-03
## 1418 -3.04787028 0.012061124 1.257976e-02
## 1932 -3.56197929 0.006915725 9.793883e-03
## 1952 -0.25148495 0.014763853 1.053232e-04
## 2782 -0.08782465 0.054472921 4.938388e-05
## 3308 -4.44901850 0.003946058 8.679611e-03
## 3811 -4.27258334 0.001088782 2.203041e-03
## 4746 -4.03477332 0.015318536 2.805189e-02

It looks like there are 5 cases that may be problematic, cases 741, 775, 3308, 3902, and 4746.

Another diagnostic is to test for heteroskedasticity (i.e., the variance of the error term is not constant).

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 9.170889    Df = 1     p = 0.002458951

It does appear there is significant heteroscedasticity in the model.

Below, I try dropping cases 741, 775, 3308, 3902, and 4746, then run an influence plot and test for heteroscedasticity again.

We also want to look for multicollinearity, that is are some of our independent variables highly correlated. We do this by looking at the Variance Inflation Factor (VIF). A GVIF > 4 suggests collinearity.

## 
## Call:
## lm(formula = quality ~ alcohol + fixed.acidity.log + free.sulf.diox.log + 
##     pH + residual.sugar + sulphates + total.sulfur.dioxide + 
##     volatile.acidity, data = Wwines_dropouts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1770 -0.5052 -0.0265  0.4454  2.6473 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.157086   0.404152   2.863  0.00421 ** 
## alcohol               0.360551   0.010415  34.618  < 2e-16 ***
## fixed.acidity.log    -0.145213   0.068516  -2.119  0.03411 *  
## free.sulf.diox.log    0.236185   0.017507  13.491  < 2e-16 ***
## pH                    0.193139   0.080637   2.395  0.01665 *  
## residual.sugar        0.022931   0.002485   9.227  < 2e-16 ***
## sulphates             0.424142   0.095052   4.462 8.29e-06 ***
## total.sulfur.dioxide -0.002328   0.000362  -6.432 1.38e-10 ***
## volatile.acidity     -1.743506   0.110081 -15.838  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7397 on 4884 degrees of freedom
## Multiple R-squared:  0.2973, Adjusted R-squared:  0.2961 
## F-statistic: 258.2 on 8 and 4884 DF,  p-value: < 2.2e-16

##          StudRes         Hat        CookD
## 254  -3.83791590 0.005005846 8.210813e-03
## 446  -4.30207872 0.001891299 3.882777e-03
## 1228 -4.27141241 0.002118194 4.288016e-03
## 1416 -3.11394252 0.012155346 1.323380e-02
## 1930 -3.62686985 0.007001829 1.028028e-02
## 1950 -0.26729449 0.014777809 1.190957e-04
## 2780 -0.07648524 0.054649518 3.758328e-05
## 3808 -4.30490087 0.001089529 2.237896e-03
## 4036 -0.99870812 0.014808194 1.665774e-03
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 7.88333    Df = 1     p = 0.004989254

At this point, I have logged some of the variables, dropped two variables without reducing R-squared much, compared 3 models, and dropped a few cases to reduce heteroscedasticity. Unfortunately, there still appears to be some cases skewing the data. At this point, unless there are multicollinearity issues, I think I will take my model to Dr. Monster to consult on further methods for manipulating the data and adjusting the model appropriately.

But first, I check on multicollinearity.

##              alcohol    fixed.acidity.log   free.sulf.diox.log 
##             1.467500             1.286266             1.694721 
##                   pH       residual.sugar            sulphates 
##             1.324745             1.421177             1.052046 
## total.sulfur.dioxide     volatile.acidity 
##             2.087468             1.098414

No variables seem to be causing any problems.

Conclusion

Controlling for the other independent variables, a 1-unit increase in alcohol is associated with a .36 increase in white wine ratings, statistically significant at p<.001 in the final model. To optimize ratings, adjusting the alcohol level higher seems like a good recommendation. Additionally, lower fixed acidity, higher free sulfur dioxides, a higher pH level, more residual sugars, higher sulphate levels, less total sulfur dioxide, and less volatile acidity all would likely contribute to higher ratings as they are statistically significant predictors. Theoretically, I would imagine wine qualities to hit saturation points where increasing or decreasing certain qualities will no longer have a positive outcome on wine ratings. I might discuss with Dr. Monster the possibility of a nonlinear model producing a better fit, or some other method for determining thresholds.

The link to my Github account is https://github.com/craigalder. The link to my repository for this assignment is https://github.com/craigalder/N741gapminder1.git.